NSF PAR Search | NSF Public Access Repository

Energy-based Models are Zero-Shot Planners for Compositional Scene Rearrangement

https://doi.org/10.15607/RSS.2023.XIX.030

Gkanatsios, Nikolaos; Jain, Ayush; Xian, Zhou; Zhang, Yunchu; Atkeson, Christopher; Fragkiadaki, Katerina (July 2023, Robotics: Science and Systems 2023)

Language is compositional; an instruction can ex- press multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that gen- eralizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language- instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual- language model grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predi- cate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on es- tablished instruction-guided manipulation benchmarks, as well as benchmarks of compositional instructions we introduce. We show our model can execute highly compositional instructions zero-shot in simulation and in the real world. It outperforms language- to-action reactive policies and Large Language Model planners by a large margin, especially for long instructions that involve compositions of multiple spatial concepts. Simulation and real- world robot execution videos, as well as our code and datasets are publicly available on our website: https://ebmplanner.github.io.

Full Text Available

Search for: All records